Add support for Android Dalvik disassembly#90
Open
r0ny123 wants to merge 25 commits into
Open
Conversation
Co-authored-by: google-labs-jules[bot] <161369871+google-labs-jules[bot]@users.noreply.github.com>
Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>
… and handle unknown opcodes - Remove incorrect len(dex_file.header) bounds check (lief.DEX.Header is not a buffer); replace with proper validation against raw_data size - Stop assuming length=1 for unknown Dalvik opcodes which would desync the instruction stream; instead log a warning and abort method disassembly - Apply ruff formatting to DalvikFunctionAnalysisState.py
The opcode table previously only covered 0x00-0x3D and 0x6E-0x78, causing the disassembler to abort on any method using common opcodes like array ops (aget/aput), field ops (iget/iput/sget/sput), arithmetic, type conversions, or literal operations. Now covers all 209 defined opcodes per the Android Dalvik bytecode spec: - Array operations (0x44-0x51) - Instance field operations (0x52-0x5F) - Static field operations (0x60-0x6D) - Unary operations and type conversions (0x7B-0x8F) - Binary 3-register operations (0x90-0xAF) - Binary 2addr operations (0xB0-0xCF) - Binary lit16/lit8 operations (0xD0-0xE2) - invoke-polymorphic, invoke-custom, const-method-handle/type (0xFA-0xFF) Ref: https://source.android.com/docs/core/runtime/dalvik-bytecode
1. Switch analyzeFunction from linear sweep to recursive traversal using the block_queue in DalvikFunctionAnalysisState. This prevents data payloads (packed-switch, sparse-switch, fill-array-data) from being misinterpreted as instructions and produces a correct CFG. 2. Include class name in invoke-* operand strings for disambiguation (e.g. 'Ljava/lang/Object;-><init>' instead of just '<init>'). 3. Replace list() fallback with bytes() for LIEF DEX parsing, which is more efficient and idiomatic. 4. block_queue and related methods (chooseNextBlock, addBlockToQueue, hasUnprocessedBlocks, endBlock) are now actively used for the recursive traversal. 5. Remove dead identifyCallConflicts method from DalvikFunctionAnalysisState to reduce cognitive load. 6. Fix instruction addresses to be relative to bytecode_offset (code_item + 16) rather than the code_item start, which was causing all addresses to be 16 bytes too low.
LIEF's Python bindings for parse() interpret a bytes() object as a string filename, which was causing the fallback parsing intended for raw buffers to silently fail and return a null or empty object without resolving methods. Using list() correctly triggers the C++ binding for raw buffer parsing, allowing in-memory DEX analysis to work as intended.
- Fixed LIEF method.code_offset bug where code_offset points directly to the bytecode. We now safely backtrack 16 bytes to the code_item header to extract insns_size natively via struct unpacking, avoiding AttributeError since LIEF python bindings omit insns_size on CodeInfo. - Added addPdbFile to DalvikDisassembler to fulfill the expected interface and prevent core Disassembler from crashing. - Enabled native C++ fstream file loads when file_path is available to avoid massively slow parsing of 50,000+ Method Android APKs using the memory list fallback.
- Added DalvikDisassembler integration testing using a minimal XORed DEX binary. - Fixed a silent parse failure in LIEF Python bindings where \�ytes\ objects passed directly were incorrectly parsed as filenames by the C++ engine. It now natively falls back on list(buffer) to forcefully utilize the memory buffer parsing path.
Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Remove the redundant try/except fallback that attempted lief.DEX.parse(bytes) before falling through to list(). Since LIEF silently treats bytes as a filename string and returns None without raising TypeError, the fallback was unreliable. Using list() directly is the correct and only safe path for raw in-memory DEX parsing, matching LIEF's C++ raw buffer overload. Also expand testBufferDisassembly assertions to validate function, instruction, and block counts now that disassembleUnmappedBuffer correctly returns results. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- P1-A: auto-detect DEX in disassembleBuffer() via magic-byte check so callers no longer need to pass architecture="dalvik" explicitly - P1-B: guard SmdaInstruction.getDetailed() against non-Intel architectures to prevent a silent Capstone x86 decode of Dalvik bytecode - P2-A: bound _getPayloadSize() to len(bytecode)-idx for all three payload types; guard element_width==0 in fill-array-data to prevent DoS on forged DEX headers - P2-B: two-pass target validation — linear sweep builds valid_instruction_starts; switch-table and exception-handler targets are rejected if they don't land on a decoded instruction boundary; violations recorded in function_metadata - P2-C: track decode_error_count / is_partial in DalvikFunctionAnalysisState and propagate both fields to function_metadata so partial functions are clearly signalled rather than silently appearing fully analysed - P3-A: fix const/high16 and const-wide/high16 (format 21h) to use signed=True for the 16-bit immediate, matching baksmali's sign-extended output - P3-B: add baksmali-style escape map in DexReferenceResolver to sanitise NUL bytes, newlines, tabs, and other control chars in DEX string literals - P3-C: extend DexFileLoader._parseHeader() to accept ODEX (dey\n) and CDEX (cdex) magic bytes; apply ruff PIE810 fix (tuple startswith) - Add 7 new tests covering all the above (23 total in testDalvikDisassembler.py) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ruff format --check was failing on Linux (LF line endings) due to: - missing trailing blank line in Disassembler.py - multiline raise collapsed to single line in SmdaInstruction.py - dict literal key alignment and quote style in DalvikDisassembler.py - quote style, trailing blank lines, inline comment spacing in tests Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>
- Moved DexFileLoader import to the top of smda/Disassembler.py for consistency. - Removed redundant 'as' aliases in smda/dalvik/__init__.py.
- Combined nested if statements in Disassembler.py (SIM102). - Restored explicit re-exports in dalvik/__init__.py (F401).
Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>
- SmdaInstruction.getDataRefs: yield both explicit data_refs_from entries and Intel operand-derived refs (deduped) instead of returning early after explicit refs. - SmdaFunction._getCfgRoot: return None when offset has no entry block, so dominator/nesting are skipped rather than computed from a fabricated root. - Disassembler._callbackAnalysisTimeout: log on 30s bucket transitions to remain reliable when callback timing skips past exact boundaries. - DalvikDisassembler: prefer bytes for lief.DEX.parse (avoids ~30x memory blowup on large DEX), with fallback to list for older LIEF. - DalvikDisassembler._buildValidInstructionStarts: advance by 2 bytes on resync (Dalvik instructions are 16-bit aligned). - analyze.py: replace hand-rolled wrap with textwrap.wrap. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- DexReferenceResolver: rename snake_case methods to camelCase to match the rest of the smda codebase (formatMethod, formatField, formatProto, formatTypeByIndex, formatRef, getMethod, getMethodTarget, getStringValue, getMethodMetadata, plus private helpers _indexItems, _safeGet, _safeAttr, _normalizeTypeString, _formatType, _formatProto). Module-level functions in DalvikOpcodeDecoder are left snake_case to match the existing style of StringExtractor/DominatorTree/CilDisassembler. - Disassembler.disassembleBuffer: when DEX magic is autodetected on a Disassembler that was constructed with an explicit Intel backend, reset self.disassembler so initDisassembler creates a DalvikDisassembler instead of running Intel on DEX bytes. - Add testDisassembleBufferIntelOnDexAutodetectsDalvik to pin the contract that magic-byte autodetect overrides explicit-Intel. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.